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Abstract 

Wc consider the problem of allocating radio channels to links in a wireless network. Links interact 
through interference, modelled as a conflict graph (i.e., two interfering links cannot be simultaneously 
^_ active on the same channel). We aim at identifying the channel allocation maximizing the total net- 

work throughput over a finite time horizon. Should we know the average radio conditions on each 
-1 channel and on each link, an optimal allocation would be obtained by solving an Integer Linear Pro- 

gram (ILP). When radio conditions are unknown a priori, we look for a sequential channel allocation 
^ policy that converges to the optimal allocation while minimizing on the way the throughput loss or 

I ^ I regret due to the need for exploring sub-optimal allocations. We formulate this problem as a generic 

linear bandit problem, and analyze it first in a stochastic setting where radio conditions arc driven by 
a stationary stochastic process, and then in an adversarial setting where radio conditions can evolve 
^ arbitrarily. We provide, in both settings, algorithms whose regret upper bounds outperform those of 

\l existing algorithms for linear bandit problems. 

OS 

^ 1 Introduction 
(N 

Spectrum is a key and scarce resource in wireless communication systems, and it remains tightly controlled 
by regulation authorities. Most of the frequency bands are exclusively allocated to a single system licensed 
. . to use it everywhere and for periods of time that usually cover one or two decades. The consensus on 

^ this rigid spectrum management model is that it leads to significant inefficiencies in spectrum use. The 

k>( explosion of demand for broadband wireless services also calls for more flexible models where much larger 

^ I spectrum parts could be dynamically shared among users in a fluid manner. In such models. Dynamic 

Spectrum Access (DSA) techniques will play a major role. These techniques make it possible for radio 
devices to become frequency-agile, i.e. able to rapidly and dynamically access bands of a wide spectrum 
part. 

In this paper, we consider wireless networks where transmitters can share a potentially large number 
of frequency bands or channels for transmission. In such networks, transmitters should be able to select a 
channel (i) that is not selected by neighbouring transmitters to avoid interference, and (ii) that offers good 
radio conditions. A spectrum allocation is defined by the channels assigned to the various transmitters 
or links, and our fundamental objective is to devise an optimal allocation, i.e., maximizing the network- 
wide throughput. If the radio conditions on each link and on each channel were known, the problem 
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would reduce to a combinatorial optimization problem, and more precisely to an Integer Linear Program. 
For example, if all links interfere each other (no two links can be active on the same channel), a case 
referred to as full interference, the optimal spectrum allocation problem is an instance of a Maximum 
Weighted Matching in a bipartite graph (vertices on one side correspond to links and vertices on the 
other side to channels; the weight of an edge, i.e., a (link, channel) pair, represents the radio conditions 
for the corresponding link and channel). In practice, the radio conditions on the various channels are not 
known a priori, and they evolve over time in an unpredictable manner. Hence, we need to dynamically 
learn and track the optimal spectrum allocation. This task is further complicated by the fact that we 
can gather information about the radio conditions for a particular (link, channel) pair only by actually 
including this pair in the selected spectrum allocation. We face a classical exploration vs. exploitation 
trade-off problem: we need to exploit the spectrum allocation with highest throughput oberserved so far 
whilst constantly exploring whether this allocation changes over time. We model our sequential spectrum 
allocation problem as a linear multi-armed bandit problem. The challenge in this problem resides in the 
very high dimension of the decision action space, i.e., in its combinatorial structure: the size of the set 
of possible allocations exponentially grows with the number of links and channels. 

We study generic linear bandit problems in two different settings, and apply our results to sequential 
spectrum allocation problems. In the stochastic setting, we assume that the radio conditions for each 
(link, channel) pair evolve over time according to a stationary (actually i.i.d.) process whose mean is 
unknown. This first model is instrumental to represent scenarios where the average radio conditions 
evolve relatively slowly, in the sense that the spectrum allocation can be updated many times before this 
average exhibits significant changes. In the adversarial setting, the radio conditions evolve arbitrarily, as 
if they were generated by an adversary. This model is relevant when the channel allocation cannot be 
updated at the same pace as radio conditions change. In both settings, as usual for bandit optimization 
problems, we measure the performance of a given sequential decision policy through the notion of regret, 
defined as the difference of the performance obtained over some finite time horizon under the best static 
policy (i.e., assuming here that average radio conditions are known) and under the given policy. We make 
the following contributions: 

• For stochastic linear bandit problems: 

(a) we derive an asymptotic lower bound for the regret of any sequential decision policy, and show 
that this bound typically scales as {n x c)log(r), where n, c, and T denote the number of links, 
the number of channels and the time horizon, respectively. 

(b) Wc propose two sequential decision policies for linear bandit problems, that are simple extensions 
of the classical UCB and e-greedy algorithms, and provide upper bounds on their regret, scaling 
respectively as (n^ x c) log(T) and (n^ x c)log(r). The latter constitutes the best regret upper 
bound for the linear bandit problem considered. 

• For adversarial linear bandit problems: We propose ColorBand, a new sequential decision policy, 
and derive an upper bound on its regret. For example in the full interference case, when the numbers 
of channels and links are identical, this bound scales as ^ \og{n)T (this improves over the upper 
bounds for the best previously known algorithms [6 12 , which scale as 

log(n)r). 

Related work. Spectrum allocation has attracted considerable attention recently, mainly due to the 
increasing popularity of cognitive radio systems. In such systems, transmitters have to explore spectrum 
to find frequency bands free from primary users. This problem can also be formulated as a bandit 
problem, see e.g. (T 13 , but is simpler than our problem (in cognitive radio systems, there are basically c 



unknown variables, each representing the probability that a channel is free). Spectrum sharing problems 
similar to ours have been very recently investigated in [11[|16| . Both aforementioned papers restrict their 
analysis to the case of full interference, and even in this scenario, we obtain better regret bounds. As far 
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as we know, adversarial bandit problems have not been considered to model spectrum allocation issues. 
There is a vast literature on bandit problems, both in the stochastic and adversarial settings, see [s] for 
a quick survey. Surprisingly, there are very little work on linear bandit with discrete action space in the 
stochastic setting, and existing results are derived for very simple problems only, see e.g. 
therein. In contrast, the problem has received more attention in the adversarial setting 
algorithm we devise yields a regret upper bound that beats all known bounds of algorithms previously 
proposed in the literature. 

2 Models and Objectives 

2.1 Network and interference model 

Consider a network consisting of n links indexed by i G [n] — {1, . . . , Each link can use one of the 
c available radio channels indexed by j G [c]. Interference is represented as a conflict graph G = {V,E) 
where vertices are links, and edges S i? if links i and i' interfere, i.e., these links cannot be 

simultaneously active. A spectrum allocation is represented as a configuration M S {0,1}"^'^, where 
Mij = 1 if and only if link-i transmitter uses channel j. M is feasible if (i) for all i, the corresponding 
transmitter uses at most one channel, i.e., X]je[c] ^'-^v ^ {^i 1}; (ii) two interfering links cannot be active 
on the same channel, i.e., for all € [n], G E implies for all j G [c], MijMi'j = Let ^A be the 

set of feasible configurations. For M G A4, ii link i is active, we denote by M{i) the channel allocated to 
this link. We also write G M for i G [n] and j G [c], if link i is active under configuration M, and 
j = M{i). In the following we denote by /C = {JCe, I G [k]] the set of maximal cliques of the interference 
graph G. We also introduce Kn G {0, 1} such that = 1 if and only if link i belongs to the maximal 
clique Ki. 

We will pay a particular attention to the full interference case, where the conflict graph G is complete. 
In such a case, a feasible configuration M is just a matching in the complete bipartite graph ([n], [c]), 
where on one side we have the set [n] of links, and on the other side, the set [c] of radio channels. 

2.2 Fading 

To model the way radio conditions evolve over time on the various channels, we consider a time slotted 
system, where the duration of a slot corresponds either to the transmission of a single packet or to that of 
a fixed number m of packets. The channel allocation, i.e., the chosen feasible configuration, may change 
at the beginning of each slot. We denote by rij{t) the number of packets successfully transmitted during 
slot t when link-i transmitter selects channel j for transmission in this slot and in absence of interference. 
Depending on the ability of transmitters to switch channels, we introduce two settings. 

In the stochastic setting, the number of successful packet transmissions rij{t) on link i and channel 
j are independent over i and j, and are i.i.d. across slots t. The average number of successful packet 
transmission per slot is denoted by E[rij(t)] = 9ij, and is supposed to be unknown initially. If m packets 
are sent per slot, rij{t) is a random variable whose distribution is that of Yij/m where Yij has a binomial 
distribution Bin(rn,0ij). When to = 1, rij{t) is a Bernoulli random variable of mean 9ij. The stochastic 
setting models scenarios where the radio channel conditions are stationary. 

In the adversarial setting, rij{t) G [0, 1] can be arbitrary (as if it was generated by an adversary) , and 
unknown in advance. This setting is useful to model scenarios where the duration of a slot is comparable 
or smaller than the channel coherence time. In such scenarios, we assume that the channel allocation 

^This model assumes that the interference graph is the same over the various channels. Our analysis and results can be 
extended to the case where one has different interference graphs depending on the channel. 
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cannot change at the same pace as the radio conditions on the various hnks, which is of interest in 
practice, when the radios cannot rapidly change channels. 

In the following, we denote by rM{t) the total number of packet successfully transmitted during slot 
t under feasible configuration M G M, i.e., 

rM{t) = E E MiJ^vi^) = M.r{t). 
ie[n\ je[c] 

2.3 Channel allocations and objectives 

We analyze the performance of adaptive spectrum allocation policies that may select different feasible 
configurations at the beginning of each slot, depending on the observed received throughput under the 
various configurations used in the past. More precisely, at the beginning of each slot t, binder policy tt, 
a feasible configuration M'^{t) £ M is selected. This selection is made based on some feedback on the 
previously selected configurations and their observed throughput. We consider two types of feedback. 

Under detailed feedback, at the end of slot t, the number of packets successfully transmitted on the 
various links are observed, i.e., the feedback f{t) is {rij{t),i,j : M[j{t) = 1). Under aggregate feedback, at 
the end of slot t, the total number of successfully sent packets is known, and so the feedback f{t) is simply 
rM'(t){i)- Aggregate feedback is of interest when we are not able to maintain the achieved throughput 
per link. 

At the beginning of slot t, the selected configuration M{t) may depend on past decisions and the 
received feedback, i.e., on Af'^(l), /(I), . . . , M'^(t — — 1). The chosen configuration can also be 

randomized (at the beginning of a slot, we sample a configuration from a given distribution that depends 
on past observations). We denote by 11 the set of feasible policies. The objective is to identify a policy 
maximizing over a finite time horizon T the expected number of packets successfully transmitted or 
simply what we call the reward. The expectation is here taken with respect to the possible randomness 
in the stochastic rewards (in the stochastic setting) and in the probabilistic successively selected channel 
allocations. Equivalcntly, we aim at designing a sequential channel allocation policy that minimizes the 
regret. The regret of policy tt G 11 is defined by comparing the performance achieved under tt to that of 
an idealised policy that assumes that the average conditions on the various links and channels are known: 

T T 

R-{T) = max E[5^rM(t)] - E[E 

i=l 1=1 

where M^{t) denotes the action set selected in step t. The notion of regret quantifies the performance 
loss due to the need for learning radio channel conditions, and the above problem can be seen as a linear 
bandit problem. 



3 Optimal Static Allocation 

When evaluating the regret of a sequential spectrum allocation policy, the performance of the latter is 
compared to that of the best static allocation: 

T 

M* e arg max E[^rM(t)], 
t=i 

where in the above formula, the expectation is taken with respect to the possible randomness in the 
throughput rM{t) (in the stochastic setting only). To simplify the presentation, we assume that the 
optimal static allocation M* is unique (the analysis can be readily extended to the case where several 
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configurations are optimal, but at the expense of tfie use of more involved notations). To identify Af*, 
we have to solve an Integer Linear Program (ILP). Let us first introduce the following set of ILPs 
parameterized by vector r — [rij ,i G [n],j € [c] ) . 

max TijMij (2) 

s.t. ^ < 1, Vi e [n], 

^A«M,, <1, V£e[fc],je[c] 

ie [n] 

A/,, e{0,i}, VzeN,je[c], 

and denote by V{r) its solution. In the stochastic setting, the performance of the best static pohcy is 
then = V{6), 9 = {Oij,i € [n],j g [c]), whereas in the adversarial setting, the best static policy yields 
a reward equal to V{J2f^^ ''(0)- 

Lemma 1 The ILP problem ^ is NP-complete for general interference graphs. 

Indeed our ILP problem is a coloring problem of the interference graph G. If one considers all the 
links allocated to a given channel, we obtain a stable set of G. To be more precise, already with only one 
channel, our problem is NP-complete as when c—1 and r^ = 1 for all i G [n], then the optimum value of 
([2]) is the stable set number of the interference graph G which is a NP-complete problem (Theorem 64.1 
in [18| ). It should be noticed that in contrast, when the interference graph is complete, i.e., in the full 
interference case, the ILP problem can be interpreted as a maximum weighted matching in a bipartite 



graph. As a consequence, it can be solved in polynomial time 18 



4 Stochastic Bandit Problem 

This section is devoted to the analysis of our linear bandit problem in the stochastic setting. We first 
derive an asymptotic lower bound on the regret achieved by any feasible sequential spectrum allocation 
policy. This provides a fundamental performance limit that no policy can beat. We then present two 
policies that naturally extend strategies used in classical multi-armed bandit problems to linear bandit 
problems, and provide upper bounds on their respective regret. Most of the results presented here concern 
scenarios where detailed feedback is available. 



4.1 Detailed feedback 

4.1.1 Asymptotic regret lower bound 

In their seminal paper [14,, Lai and Robbins consider the classical multi-armed bandit problem, where 
a decision maker has to sequentially select an action from a finite set of K actions whose respective 
rewards are independent and i.i.d. across time. For example, when the rewards are distributed according 
to Bernoulli distributions of respective means 

9i, ... ,9k, they show that the regret of any online action selection policy tt satisfies the following lower 
bound: 

hminf ^-^^ 



T^oolog(r) - ^^KL{9i,9,)' 
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where without loss of generality 6i > 9i for all i =/= I, and KL{u, v) is the KL divergence number between 
two Bernoulli distributions of respective means u and v, 

KL{u,v) = u\og{u/v) + (1 — M)log(l — — v). The simphcity of this lower bound is due to the 

stochastic independence of the rewards obtained selecting different actions. In our linear bandit problem, 
the rewards obtained selecting different configurations are inherently correlated (as in these configurations, 
a link may be allocated with the same channel). Correlations significantly complicate the derivation and 
the expression of the lower bound on regret. To derive such a bound, we use the techniques used in [s] 
to study the adaptive control of Markov chains. 

We use the following notation: 9 = [0,1]"^^=; 9 = G [n],j e [c]); /i*^(A) = A/.A, for any A/ e M 

and X G Q. Recall that fi* — maxMeM M • 9, and the optimal configuration is A/*, i.e., fi* — M* • 9. 

We introduce B(9) as the set of had parameters, i.e., the set of A G 8 such that configuration M* 
provides the same reward as under parameter 9, and yet M* is not the optimal static configuration: 

B{9) = {A e e : (Vi,j : M* = I,A.y = and /i* < max /i*^(A)}. 

Then B{9) = Um=^m*Bm{9) where 

BMi9) = {A e e : (Vz,j : M* = I,A,, = and < M*'(A)}. 

The reward distribution for link i under configuration M and parameter 9 is denoted by Af, 9). This 
distribution is over the set S ~ {0, 1} when one packet is sent per slot, or the set 5 = {0, 1/m, . . . , 1} if m 
packets per slot are sent. Of course when X)j6[c] ~ ha.ve Pi{0] M, 9) = 1. When X]je[c 

1 = A'fjj\.f(j'), if a single packet is sent per slot, we have, for yi E {0, 1}, 

p.(y.;A/0)=0f;,(^)(I-0,Mw)'~^% 
and if m packet are sent, we have, for yi S {0, 1/m, . . . , I}, 

pM; m, 9) = (2 ) - e.Mi.))"'^ 

\"''yi/ 

We define the KL divergence number KL^^ {9, A) under static configuration M as: 



M, 



KL^\9,X) = E E log ^f-!!'f^ ^.(y.;M,^). 



For instance, when a single packet is sent per slot, we get: 

KL^'{9,X) = E E M,,KL{9,,,X,,). 

iG[n] jG[c] 

As we shall see later in this section, we can identify sequential spectrum allocations whose regret scales 
as log{T) when T grows large. Hence we restrict our attention to so-called uniformly good policies: tt e 11 
is uniformly good if for all 6* e 0, if the configuration M is sub-optimal (A/ 7^ Af*), then the number 
of times T]\j{t) it is selected up to time t satisfies: E[TM(i)] — o{t'') for all 7 > 0. We are now ready to 
state the regret lower bound. 

Theorem 1 For all 9 Cz Q, for all uniformly good policy tt e 11, 
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where C{9) solves the following optimization problem: 




(4) 



s.t. inf > XMKL''He,X)>iyM^M*. 



(5) 



The above lower bound is unfortunately not explicit. In the case of full interference, however, the 
bound can be simplified, and we may characterize how it scales with the numbers of links and channels. 

Theorem 2 In the case of full interference, when a single packet is sent per slot, we have 



The above theorem states that there exist positive constants /ci > 0, fc2 > (that depend on 0) such 
that C{9)/{nc) e [ki,k2\ for n, c large enough. This result is intuitive and means that the regret has to 
scale with the number of unknown parameters in the system. 

4.1.2 Regret of UCB-like algorithms 

We investigate here the performance of a variant of the celebrated UCB algorithm [2] . The idea of this 
variant is to attach to each (link, channel) pair an index qij that evolves over time, and to use these 
indexes to sequentially select configurations. A similar UCB-like algorithm has been recently proposed 



Denote by fij^s — 7 X]t=i '''iji^) the empirical average number of packets successfully sent over link i 



where Tij{t) is the number of times channel j has been allocated to link i up to time t. The proposed 
variant of UCB algorithm selects at any slot the configuration maximizing the sum of the indexes of the 
(link, channel) pairs. To initialize the algorithm, we need to first build an index for each (link,channel) 
pair. To do so, we start by selecting a set .A C of configurations that covers all possible pairs. The 
construction of such a set is easy, and for example, in the case of full interference, we can simply use a 
set of max(n, c) configurations or matchings. Let A be the cardinality of A. 

Algorithm 1: UCB Algorithm 

Initialization: For t = 1, . . . ,A, select configurations in A, observe the detailed rewards, and update q{t). 

for all t > A do 

Select configuration M{t) G argmaxMsAi M • q{t). 

Observe the detailed rewards, and update the vector of indexes q{t + 1). 



C{6) — 0(n X c), as n,c ^ 00. 



in 11 . 



and channel j if channel j has been allocated s times to link i. Introduce ps,t 
the pair (link i, channel j) is defined by: 

9u(i) = %T.,(t) +PT.,(t),t, 



The index of 



end 



Theorem 3 For a > n + 1/2, we have: 



R 



.UCB 



(T) < 4q 




n^c\og{T) + 0{l), as T-^ 00, 



where 



■max — 



maxMeAi(M* - M*'^(^)); '^min = uiinM^M* {fi* - fJ'^^O)). 
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Note that this bound is vahd for any interference graph. For the fuh interference case, however, it can 

d on the regret scales as ^ffi^ncmin(n, c)'^ log(T). 

11 where the analysis was restricted to the case n < c. 



be easily shown that the bound on the regret scales as ^ffi^ncmin(n, c)'^ log(T). This is also consistent 



with the result derived in 



4.1.3 Regret of e-greedy algorithms 

e-greedy algorithm consists in selecting the configuration that has provided with the maximum reward 
so far with probability 1 — et, and a configuration selected uniformly at random among the covering set 
A of configurations. By reducing the exploration rate et over time, a logarithmic regret can be achieved. 
More precisely, we will choose Ct = mm{l, d/t) for some constant d> 0. 
Define f{t) = (ry,Ti,(t), « € [n],j G [c]). 



Algorithm 2: e-greedy Algorithm 

Initialization: For t = 1, . . . , A, select configurations in A, observe tiie detailed rewards, and update rij^t- 

for all t > A do 

Let et = min(l, d/t). 

Select configuration Al{t) G argmaxMGAi M • f{t) with probability 1 — et, and a configuration 
uniformly selected at random in A with probability et. 

Observe the detailed rewards and update r(t + 1). 
end 



Theorem 4 There exists a choice of parameter d > 10 An? / A^-^^ such that we have: 

j^e- greedy ^j.^ < IQA^^u'^ \og{T) + 0(1) OS T ^ OO. 

mill 

For the case of full interference, the bound on the regret can be improved to 

j^e- greedy ^j.^ < lOA^^ min(n, cf log{T) + 0(1). 
min 

Also notice that for this case, we can select A with A — max(n, c). As a result, for the full interference 
case with this choice of A, the regret scales as max(n, c) min(n, c)^ log(r) when T grows large. 

Compared to the regret bound of UCB, a factor has been removed. The upper bound proposed in the 
above theorem is the best bound derived so far, even for the full interference case. Note however, that 
this does not imply that e-greedy outperforms UCB algorithm, as we cannot compare algorithms by just 
looking at their respective regret upper bounds. 

For the general interference graph, we can select A with A = max(c, b) where b is the minimum 
number of channels required to obtain a feasible channel assignment (with respect to constraints of ^) 
in which every node i S [n] is assigned a channel. It is obvious that feasible channel assignment over 
conflict graph G is equivalent to 5-coloring problem of graph G which is to label vertices of G with b 
colors such that no two vertices sharing the same edge have the same color. As a result, we can select 
b = 7(G), where 7(G) is the chromatic number of conflict graph G, and finally A — max(c, 7(G)). This 
also confirms the choice oi A = max(c, n) for the full interference case as for complete graph G, 7(G) — n. 
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4.2 Aggregate feedback 



We now briefly discuss the case where aggregate feedback only is available. We derive an asymptotic 
lower bound for regret in this scenario, but let for future work the design of sequential spectrum allocation 
strategies. 

For any M G M, we define Am — {{hj) G N x [c] : J2ie[c] ^'^u — ^'^ij ~ Further introduce for all 
fc = 0, 1, . . . , n: 

p{k;M,e)^ n % n (1-^^^)' (6) 

AcAM,\A\=k {i,j)eA {i,j)eAm\A 

and 

p{k-M,e) 



In the following theorem, we derive an asymptotic regret lower bound. This bound is different than 
that derived in Theorem [T] due to the different nature of the feedback considered. Comparing the two 
bounds may indicate the price to pay by restricting the set of spectrum allocation policies to those based 
on aggregate feedback only. 

Theorem 5 For all 9 Cz Q, for all uniformly good policy tt G 11, 

hm inf f^>Cm, (7) 

T->oo log(i ) 

where C2{9) solves the following optimization problem: 

.,,>a,^E-M(.*-.-(^)), (8) 

^ MeM 

s.t. inf V XMKL^\e,X)>l,yM ^ M*. (9) 



5 Adversarial Bandit Problem 

In this section, we study the problem in the adversarial setting. In D, a regret bound of 0{VT) is derived 
in this setting, where the constant scales as the square root of the number of arms (up to logarithmic 
factors) and linearly with the reward of a maximal allocation. In our case, the number of arms typically 
grows exponentially with n even in simple cases. For example, in the full interference case, the number 
of possible allocations is the number of matching in the complete bipartite graph ([n], [c]), i.e. ^^"'^^i if 
n > c. Also, in our case since rij{t) € [0, 1], the maximal reward of an allocation is of the order min(7i, c). 
In the sequel, using the structure of our problem, we derive an algorithm with the same dependence in 
time as in Jsj but with much lower constants, scaling as uy/cYogc. 
We start with some observations about the ILP problem ([2]): 

max r • M — max 

M&M p(M)>a.j:,,^j^p(M) 

— max r • a, 

fj.eCo{M) 

where Co{Ai) is the convex hull of the feasible allocation matrices Ai. 

We identify matrices in K"^'^ with vectors in M"'^. Without loss of generality, we can always assume 
that c is sufRciently large (possibly adding artificial channels with zero reward) such that for all i £ [n], 



22 p{M)r»M 
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J2je[c] ^^ij ~ M E Ai, i.e. all links are allocated to a (possibly artificial) channel. Indeed, this 

can be done as soon as c > 7(G) where 7(G) is the chromatic number of the interference graph G. In 
other words, the bounds derived below are valid with c replaced by the maximum between the number 
of channels and the chromatic number of the interference graph. With this simplifying assumption, we 
can embed Ai in the simplex of distributions in M"'^ by scaling all the entries by n. Let V be this scaled 
version of Co{M). 

We also define the matrix in M"^"^ with coefficients /i^^ — J^mgm ^'^jj- Clearly /i" e V. We 



define /imin = minn/^"^- > Our algorithms are inspired from 
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where full information is revealed 



and used the projection onto convex sets using the KL divergence(see Chapter 3, I-projections in [7j). 
We denote the KL divergence between distributions q and p in V (or more generally in the simplex of 
distribution in K"°) by: 

ifL(g||p) = ^g(e)log^, 

where e ranges over the couples G [n] x [c] and with the usual convention where plog | is defined to 
be if p = and +00 if p > g = 0. By definition, the projection of a distribution q onto a closed convex 
set S of distributions is the p* e S such that 

KL{p*\\q) = lamKLipWq). (10) 
pes 

5.1 Detailed feedback 

We first present our algorithm in the case of detailed feedback. 

Algorithm 3: ColorBand-1 Algorithm 

Initialization: Start with the distribution go = fjP . 
for all t > 1 do 

Let pt-i = (1 - ■y)qt-i + 7/i" (Pt-1 G P so that npt-i G Co{M)). 
Select a random allocation Af(t) with distribution npt-i- 

Get a reward rt = X^i j ^ii{'t)Mij(t) and observe the reward vector: rij{t) for all ij such that 



M,j{t) = 1 



Construct the reward matrix: rij{t) = ^ '''^ .. for all i,j with Mij(t) = 1 and all other entries are 0. 



Update qt{ij) oc qt-i{ij) exp (;qfij(t)). 

Set qt to be the projection of qt onto the set V using the KL divergence, 
end 



Theorem 6 We have 

Note that in the full interference case, we have MmL ~ min(c, n). 
Proof. We first prove the following result: 
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Lemma 2 We have for rj < 1, and any q Cz V, 

t=l t=l t=l 

where r'^(t) is the vector that is the coordinate-wise square of f{t). 
Proof: We have 

KL{q\\qt)-KL{q\\q^_^) = ^ g(e) log 



(^g(e)fe(t)+logZt, 



with 



logZt = log^gt_i(e)exp(77fe(t)) 

e 

< log qt-lie) (1 + ryfe W + V^rUt))^ 

< 77gt_i.f(t)+?72^gt_i(e)f2(t), 

e 

where we used e^^ < 1 + rjx + if'x^ for rya; < 1 in the first inequality and log(l -\- x) < x for all x > —1 in 
the second inequality. 
Hence, we have 

KL{q\\qt) - KL{q\\qt-i) < -f]q • r + r]qt-i • r(t) + Tj'^qt-i • ?'^{t). 
Generalized Pythagorean inequality (see Theorem 3.1 in [7]) gives 

KLiqWqt) + KL{qt\\qt) < KL{q\\qt). 

Since KL{qt\\qt) > 0, we get 

KL{q\\qt) - KL{q\\qt-i) < -rjq • f{t) + VQt-i • P{t) + ifqt-i • r^{t) 
Summing over t gives 

X: (<? • m ~ g.-i . m) < V E It-, • r'it) + M*^. 

•■^ — ' — ' n 

t=l t=l ' 

m 

Let Ej be the expectation conditioned on all the randomness chosen by the algorithm up to time t. By 
construction, we have Ef [^^^(t)] = rij(t), hence by linearity of expectation, we have Ej [q • f(t)] — qur{t) 
and Et [qt-i • r{t)] — qt-i • r{t). Moreover, we have 

r^{t) 

Et [qt-l • r'W] = E E 

,:G[n]jG[c] ^* -^-^ 

^-^ ^-^ n Pf^-i (ij) 
ie[n]je[c] ^* ^^-^^ 

c 

< 



1-7' 
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since riJt) < 1 and ^^^4^ < 

Using Lemma [2] and the above bound, we get with nq* the optimal ahocation, i.e. q*{e) = ^ iff 



j^C olor B and— 1 ^'J^^ ]g 



^ nq* • f(t) - ^ npt-i • f(i) 



< 1 + n7r, 

1-7 77 

since pt-i • f{t) — qt-i • f{t) < 7 and 

KL{q*\\qo) = E l°g"^° ^ l°g^mi„- 

The proof is completed by setting 77 = T)^iog^„^;„ ^ _ y^ ^""" '"^ -^^aiia- which satisfy the necessary 
technical conditions. □ 

5.2 Aggregate feedback 

We now adapt our algorithm to deal with aggregate feedback. 

Algorithm 4: ColorBand-2 Algorithm 
Only steps 5 and 7 of ColorBand-1 algorithm are modified as follows: 
5'. Get (and observe) a reward rt — X^i j 

7'. Let Et-i = E [MM^] where PI has law npt^i. Set f{t) = rt'Sf_j^M{t), where E^_i is the 
pseudo-inverse of Et„i. 



Theorem 7 We have 
Proof. We first prove a simple result: 

Lemma 3 For all x G M"'^, we /ia?;e S^j^Et-ix = x, where x is the orthogonal projection of x onto 
span{M), the linear space spanned by Ai. 

Proof: Note that for all y e M"'^, if St-iy = 0, then we have 

y'^^t-iy = E [y^MM^y] = E [{y'^Mf] = 0, (11) 
where M has law npt-i = (1 — ^)nqt-i + -ynf/^. By definition of /x", each M E Ai has a positive 



probability, so that by ( 11 ) y M = for all M € A^. In particular, we see that the linear application 



St-i restricted to span(M) is invertible and is zero on span{A4) , hence we have E^j^E(_ia 
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Lemma 4 We have 



KL{q\\qo) 



el 



Proof: The proof is the same as for Lemma [2] but in the upper bound for log Zt, we now use e^^ < 
1 + (e'' — 1) X vaUd for all x G [0, 1], the rest of the proof follows directly. ■ 
We have 

Et[f{t)] = EthS+iM(<)] 

= Et [E+ iM(t)M(i)^r(i)] 

where the last equality follows from Lemmajsjand r(t) is the orthogonal projection of r{t) onto span{A4). 
In particular, for any np £ Co{A4), we have 



E [np • f{t)] ^ npu r[t) = np • r(t). 



To simphfy notation, we denote Vt — yiJ2t=i ''(0) ^'^'^ — J2t=i 



„ColorBand—2 



by the algorithm ColorBand-2. Hence taking expectation in Lemma |4] we get 



^(i) the reward obtained 



E 



>(i-,)|^E[v.i-^l|ii|-„.r, 



this gives 



E 



Vt - Vt 



< 1 



(1 - 7)^/ 
e '' - 1 



E[Vt 



< 1 



e/1 - 1 

(1 - 
e '' - 1 



7 I nT 



e'' - 1 



Take t] = ^ = 



lo ~^ 



□ 



5.3 Implementation 

There is a specific case where our algorithm can be efficiently implementable: when the convex hull 
Co{M) can be captured by polynomial in n many constraints. Note that this cannot be ensured unless 
restrictive assumptions are made on the interference graph G since there are up to 3"/^ maximal cliques 



in a graph with n vertices 15 . There are families of graphs in which the number of cliques is polynomially 



bounded. These families include chordal graphs, complete graphs, triangle-free graphs, interval graphs, 
and planar graphs. Note however, that a limited number of cliques does not ensure a priori that Co{A4) 
can be captured by a limited number of constraints. To the best of our knowledge, this problem is open 
and only particular cases have been solved as for the stable set polytope (corresponding to the case c — 2, 
r.ii — 1 and = with our notation) [18| . 
We consider the case where 



Co{M) = {Vz, J2 ^^^3 < 1- ^^-^^1 ^ 



(12) 



j6[n] 
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Note that in the special case where G is the complete graph, we have such a representation as in this 
case, we have 



Co{M) = { ^ Ah, < 1, Vz, J2 ^^^1 ^ 1' ^J}- 

je[c] ie[n] 

We now give an algorithm for the step 6 of the algorithm, i.e. the projection onto V. Since P is a scaled 



version of Co{M), we give an algorithm for the projection onto Co{A4) given by (12 1. 
Set Ai(0) = fJ,j{0) = for all i,j and then define for t>0, 

Vi e [n], Kit + 1) = - log M,,e-^^(*) j (13) 
Vj e [c], fij{t + 1) - - maxlog KaM,,e'^''-'+'^^ . (14) 

We can show that 

Proposition 1 Let M*- = limt_j.oo A%e~'^**^*^~^J*^*\ Then M* is the projection of M onto Co{M) using 
the KL divergence. 



Although this algorithm is shown to converge, we must stress that the step (14 1 might be expensive as 



the number of distinct values of £ might be exponential in n. Again in the case of full interference, this 



step is easy and our algorithm reduces to Sinkhorn's algorithm (see 10 for a discussion). 

Pro of: First note that the definition of projection can be extended to non-negative vectors thanks 
to (10). More precisely, given an alphabet A and a vector q G M^, we have for any probability vector 

pe 



, P(a) log 

1 



= log ■ 



klli^ 



thanks to the log-sum inequality. Hence we see that p*{a) = |^ is the projection of q onto the simplex 
of R^. 

Now define A^ = {M^j,Y,.M^j < 1} and % = {Mij,Y,^Ku]Vhj < 1}. Hence n^A n HajBij = 



Co{M.). By the argument described above, iteration (13) (resp. (|14|) corresponds to the projection onto 



Ai (resp. HeBij) and the proposition follows from Theorem 5.1 in |7| 



6 Conclusion 

In this paper, we investigate the problem of sequential spectrum allocation in wireless networks where a 
potentially large number of channels are available, and whose radio conditions are initially unknown. The 
design of such allocations has been mapped into a generic linear multi-armed bandit problem, for which 
we have devised efficient online algorithms. Lower bounds for the performance of these algorithms have 
been derived, and they are shown to outperform performance bounds of existing algorithms, both in the 
stochastic setting where the radio conditions on the various channels and links are modelled as stationary 
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processes, and in the adversarial setting where no assumptions are made regarding the evolution of 
channel qualities. The practical implementation of our algorithms has just been briefly discussed. In 
particular, proposing efficient distributed implementations of these algorithms seems quite challenging, 
and we are currently working towards this objective. 
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A Proof of Theorems 1 and 5 

To derive regret lower bounds, we apply the techniques used by Graves and Lai [s] to investigate efficient 
adaptive decision rules in controlled Markov chains. We recall here their general framework. Consider a 
controlled Markov chain {Xt)t>o on a finite state space S with a control set U. The transition probabilities 
given control u are parameterized by 9 taking values in a compact metric space Q: the probability to 
move from state x to state y given the control u and the parameter 6 is p{x, y; u, 0). The parameter 6 is 
not known. The decision maker is provided with a finite set of stationary control laws G = {gi, . . . , gx} 
where each control law gj is a mapping from S to U: when control law gj is applied in state x, the 
applied control is u = gj (x) . It is assumed that if the decision maker always selects the same control law 
g the Markov chain is then irreducible with stationary distribution 7r|. Now the reward obtained when 
applying control u leading to state x is denoted by r{x,u), so that the expected reward achieved under 
control law g is: fieig) = '"(^' ('^)- There is an optimal control law given 9 whose expected 
reward is denoted fj.g G axgmaxg^a fig{g). Now the objective of the decision maker is to sequentially 
control laws so as to maximize the expected reward up to a given time horizon T. As for MAB problems, 
the performance of a decision scheme can be quantified through the notion of regret which compares the 
expected reward to that obtained by always applying the optimal control law. 

Proof of Theorem [l] We now apply the above framework to our linear bandit problem. To simplify 
the presentation, we consider the case where in each slot, a single packet is transmitted. We will indicate 
what to modify when m packets are transmitted per slot. 

The parameter 9 takes values in [0, 1]"^^. The Markov chain has values in 5 = {0, 1}". When m 
packets are transmitted per slot, S = {0, 1/m, 2/to, . . . , 1}". The set of controls corresponds to the set of 
feasible configurations M, and the set of control laws is also A4. These laws are constant, in the sense that 
the control applied by control law M docs not depend on the state of the Markov chain, and corresponds 
to selecting configuration M. The transition probabilities are given as follows: for all x,y G S, 

p{x, y- M, 9) = p{y- M, 9) = H p,{y,; M, 9), 

ie[ii] 

where for all i € [n], if J2^^^^^ Af^ = 0, p,iO;M,9) = 1, and if J2^^^^^ M,^ = M,m{,) = 1, p^{y^;M,9) = 
^iM(i)^^ ~ ^iM{i)Y~^^ ■ When ni packets are sent per slot, the last formula has to be replaced by: 
Pi{VuM,9) = - Finally, the reward r(y,M) is defined by r(y,M) = M . y. 

Note that the state space of the Markov chain is here finite, and so, we do not need to impose any cost 
associated with switching control laws (see the discussion on page 718 in |8 ). 
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We can now apply Theorem 1 in '!8 . Note that the KL number under configuration M is: 

For instance, when a single packet is sent per slot, we get: 

ie[n] je[c] 

where KL{u, v) = u\og{u/v) + (1 — u) log((l — u)/{l — v)). From Theorem 1 in [sj, we conclude that for 
any uniformly good rule tt, 

lim inf ^^-^ > C(0), 
T->oo log(r) - ^ 

where C{6) is the optimal value of the following optimization problem: 

^i^L^. E ^Mi^^*-^^''m, (15) 

XM>O.MeM ^ — ' 

^ MeM 

s.t. inf V XMKL^Ue,X) > I. (16) 
\eB(e) ^ ^ ' 

^ ' M^M- 

The result is obtained by observing that 5(6') = IJi/^itAf* ^m{S)- 

□ 

Proof of Theorem [5} In the case of aggregate feedback, when configuration M is selected at time t, 
the global reward rM{t) only is known. To take this limited feedback into account, the state space of 
the corresponding Markov chain should record the global reward only. Hence, we have S = {0, 1, . . . , n}. 
When the state is fc, it means that the global received reward is equal to k. The probability that the 
reward under configuration M is equal to k is then M, 9) defined in ([6]), and so: for all A;', k & S, 

p{k',k;M,e) ^p{k;M,9). 

Theorem [5] is then a direct consequence of Theorem 1 in (8). □ 

B Proof of Theorem 2 

To simplify the presentation, we only consider the case where n = c. We first establish an upper bound 
on C{9), and then derive a lower bound. 

Recall that C{9) solves the following optimization problem: 

inf V XMifl*-^i"'m (17) 

S.t. inf V xqKL'^{9,\)>1,\/M^M*. (18) 
ASBmW ^^^^^ 
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B.l Upper Bound for C{9) 

We define Cm as the set of iinfcs i sucli ttiat M{i) ^ M*{i) and denote by Lm its cardinaiity. To obtain 

/I) 

'M 



an upper bound for C(9), we first derive a lower bound for j'^^j{9),\/M ^ M* , defined as: 



^ inf Yl ^qKL'^{0,X) 



xeBMio) 

" '^^^r ^ XI ^^i^tM{i),\M{i)) y2 ^k, (19) 

where = {M' : {i, M{i)) e M'} is the set of configurations that share pair {i, M{i)) with M. Ob- 

serve that the cardinality of A/i(M) is (n— 1)!. Note that for two Bernoulli distributions with parameters 
p,qe (0, 1), we have KL{p, q) > 2{p - q)'^. Then 

JI^\o) > mf ^ 2(0,MW-Aafw)' E (^0) 

ieCM keAfi{M) 

s.t. ^ 

\M*(i) = OiM*(i)i e [n]. 
To derive a lower bound for J^P(0), we will use the following lemma, proved at the end of this section: 

Lemma 5 Let A C [n\ be some set with cardinality A and a £ 1^++; P G 1^+. o-nd K be some constants 
such that K > X^ig^-P*- Define Z* as 



= inf I 5] 2a,fe - z,f : ^ > if , z G [0, 1]^ I . 

Ue^ 16.4 J 



(21) 



Then, Z* > ^ ^ j_ {K - T,ieAPi) 



Applying the above lemma to problem (20), i.e., choosing for any i, Zi = XiM(i)i o,i — '^keAf {M) 
Pi = OiM(i), and K = Y^ieCu "^^ oEtain: 

Jm (^) ^ — 



T (A*0^. 



y 



Now define: 



We have: 



V{e) = L e M^l-' : ^ ^ 2(A*0', VM ^ M* 



V{d) C |a; e M^l"' : > 1, VM ^ Af*} 

Let ^^(61) = — : and define the set H{0) as: 

H(0) ^[xe M^I"\a;M > ii(0)} • 
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Then we have U{e) C V{9). Indeed, if a; e H{e), for M^M*, 

V ^- < < 2(A^)2. 



We conclude that: 



C(e) < inf V xmA'^ < inf V xmA 

xeV(0) ^ x&U{6) ^ 



M 



(n! — l)n 

MjiM* 

And thus: C{9) < O(n^) as rn- oo. 



, . • f ^max [nl - l)n 

xeH(e) ^ 2A^. (n-1)! 



B.2 Lower Bound for C{Q) 

To obtain the lower bound for (7(6*), we derive an upper bound for j'^-^(G)- For any ^ M* , define 
^ij = YlivK^c^- ) where /^(ij) = {M : (i, j) e M} is the set of configurations including Then, 
for M ^ M*, we have: 



Jm\^) = . ^iM(i)KL{6iM(i),\M(i))- (22) 
Let B'm{9) = {A G ^MCe*) : \iM{i) > dmii^^i e -Cm}- Then, 
^mW< inf V CiM(i)-^i(^'iM(i)>'^iM(i)) < inf V CiM(i)(l-^'iM(i))log:, ^MW.^ (23) 



where we used the fact that for two Bernoulli distributions with parameters p, g € (0, 1) with p < q, we 
have KL{p, q) < {1 — p) log To derive an upper bound for J^m\^)^ 

we will use the following result: 

Lemma 6 Let A C [n] be some set with cardinality A and a,p G and Q be some constants such that 
Y^ieAPi <Q<^- Define Y* as 

s.t. y^^zi> Q, 

ieA 

Zi > Pi, Vi e A. 

Then, Y* < ^^^^ EieA - Pi)- 
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Applying the previous lemma where ^ = £m, = £.iM{i),Pi = 0,M{i)yi e £a/, and Q = I]jg£m ^iM*{i), 
we get: 

= J ^ i^~('^Mi^))^^M{i) 



< 



< 



< 



Lai - iM maxje£„ ^^^/^(i) 

A*^ max^s£„ (l - 6>,m(»)) 
Lm mmig£„ (1 - 6'iM*(i)j 
/3A„ 



where B = maxM^^Af * f j ^ j-^^j. have established 

\^mm.e£j,^(l-e.jv,*(,)jy 

^A/ .~r 

Now, define 

Then: |a; G rI^'"^ : j'^j\9) > 1, VM ^ Af*| C ^^(61), and hence: 

0(9) > inf A„,i„ V XM- (25) 

Without loss of generality, from now on, we assume that M* = {(?,«) : i £ [n]}- Define T = > 
1}. We have: 

n 

Xl^A/ =Cll+5I^lJ■ 
A/ j=2 

Now since > xm*' 

n 

Now, if 



(26) 



using (25) and (26), we get: 



C{9)> inf A„iin V a;A^> inf A„,inV6j. (27) 



^ ' M^M* ^ ^ ' 3=2 
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(a) M(2) (b) Aft*) 



Figure 1: Two configurations M*^^^ and Ai"^^^ for the case of n = 5 and M* = : i S [n]}. 



Next, we provide an explicit lower bound to the LP in the R.H.S. of ( 27 1 . Consider n— 1 configurations 
M^''\k e [n - 1], where 

- {(1, + 1), (2, 1), (3, 2) . . . , (fc + 1, fc), (fc + 2, fc + 2), . . . , (n, n)} , 

and verify that T C Ufcefn-i] M'^'^K Two examples of configurations, 

Af(2) and M(4) for n = 5, are shown 

in Figure 2. 

To obtain a lower bound for X]J=2^ii ~ jjer ' ii^te that for ^ G 



62 > 

63 > 

^14 > 



max 

3 

max 

4 



/3A„ 



61 

C21 — C32 

C2I — ^32 — C43 



/3A„ 



(i,i)eM('=-i)\{M*u(i,fc)} 



6™ > 



(ij)eA/("-i)\{(i,«)} 



Summing these inequalities and taking the infimum over ^ G ■^(^), we get: 

32L E ^^^-^x^E^-^iif.) E 



«es(e) 



fc=2 



(ij)eA/("-i)\{(i,«)} 



where Wij, G M^" are strictly positive integers. We actually prove that: 



inf > Wii£ii ~ 0. 

'(ij)GA/("-i)\{(l,n)} 
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For briefty, denote Z — M*^" ^^\{(l,n)}. The above statement is equivalent to: 

3eeS((?):V(z,j)eZ,e., =0. 

In view of the definition of this is then equivalent to showing that: 

VM 7^ M\ 3i e Cm ■ («, M{i) ^ Z, 

which is easily verified. Indeed, let iq = min{i : i G /^a/}- Then (io,M(io)) ^ Z. Finally we have proved 
that: 

C(f?)> inf A,„i„ 

A . " A . 



This gives the required lower bound for C{6). 



□ 



B.3 Supporting lemmas 
Proof of Lemma [5j First observe that 



Z* > min i ^ 2a,{pi ~ Zif ■.Yz,>K, z G I 



The Lagrangian for this problem is given by 



ieA 



The minimizer of this problem z* satisfies V zL{z* ,w) ~ 0. Hence, ^ — 



equivalently z* = [pi + ^]^. So the dual problem is 



max min L{z, w) = max L{z* , w) 

w>0 zgK^ w>0 



ieA 
,,2 



: max w , 

~ ieA 



ieA 



ieA 

w 
4a, 



4aj(z* - Pj) - w or 



: max —w 
w>o 8a 
ieA 



ieA 



The maximizer of dual problem, w* , solves 

dw 



w 



ieA 



ieA 



-2-E5^ + (^-E^o- 
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Thus w* = ^ j_ ( -ftT - J2i(^aP^^^ ^'"-'^ finally 



l^ieA at 



□ 



Proof of Lemma [6| The partial Lagrangian for problem ( 24 ) is given by 

I -P. 

Ji^L - Pi J log 



The optimal solution to this problem z* satisfies W zL{z* ^w) — and z* e [pj,l],Vi G yl. Hence 

1 







dL 
dzi 



w, which yields z* 



1 



ai(l-P.) 



or equivalently z* = 1 — w > 



maxig_4 fli. Thus, for w > maxig_4 a^, the dual function is given by 
D{w) = inf L{z,w) 



= ^a,(l -K)log — + w Q - X! 



ieA 



ieA 



1 - 



= ^ a,(l - p,) log - + (g - + ^ a,(l - p,) 
"'^ leA 

and the dual problem is: maxtu>maxig^ a; D{w). Now, we establish an upper bound for K*, as follows 

_D(ti;) ^ niax_D(w) > max Z3(w) = Y* , 



where the last equality is due to strong duality for problem ( 24 ) . It can be easily shown that w — a-q 
and hence 



^ai(l -p,)log 



+ (Q-^)- 



ieA 



ieA 

Thus, we established 



a^{A-Q) 



ieA 



ieA 



\og{A~Q). 



(28) 



In order to further simplify the R.H.S. of ( 28 1, we first derive an inequality using the log-sum inequality. 



Based on this inequality, for non-negative vectors e, /, we have 

X! e» log ^ > ^ e^ log 
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or equivalently, J2i log ^=7:^ > Yji log — f-^- Now, choosing = (1 - pi)ai and = 1 - pi,Vi G yl, 



we get 



2^(1 - log — — 

ieA ^ ieA 



Ed 

ieA 



Pijai 



>^(i-p.Kiog ^y^ 



log - E(l^ 

keA ieA 



Applying this result to the R.H.S. of p8| , we then provide an upper bound to Y*, as follows 

T^jeAi^-Pj)^] 



Y* <Y,il - p,)adog 



ieA 



ieA 



< 



^ai(l-pj) log^(l-pj)- ^a,(l-p,; 
ieA jeA ^ieA 



\og{A - Q) 
\og{A - Q) 



-log 



J2^eAi^-P^) 



A~Q 



ieA 



^;T,ieA0_P^_^)^j2a,il-p.) 



A-Q 

Q-EieAP^ 
A-Q 



ieA 



ieA 



where in last two inequalities we used the fact that logz < z — 1, for z > 0. This completes the proof. □ 



C Proof of Theorem 3 

The proof is along the same lines as in fll^ . We present the proof for completeness. Let T^' (t) denote 
the number of times spent on the suboptimal configuration M ^ M* . Then the regret of UCB algorithm 
at time t is given by 

For each suboptimal connections or (link, channel) pair (i, j), define Zij{t) as follows. If a suboptimal con- 
figuration M{t) is chosen at time t, Zij{t) is increased by one for some G argmin(i/ Tiiji{t), 
where Ti'j/{t) is the number of times pair has been selected up to time t. We let Iij{t) be the 

indicator function where Iij{t) = 1 if Zij{t) is incremented by 1 at time t. Therefore, when Iij{t) = 1 
there exists a configuration M{t) ^ M* such that {i,j) G M{t), where the time index in M(t) implies 
that we may get different arms at different times. 

Let Z{t) = E^e[n]EJe[e]^u■(*)• We have Em^a/* E[T*^(t)] = E[Z(i)]. As a result, i?ucB(i) < 
AmaxEA/^A/* J2^e[n]^ZiM{i)it)]■ Define X,j{s) = r„-T,,(s) and 

Vj\/ — {i G [n] : Mij ~ 1 for some j G [c]} 



with cardinality Vm- Also recall that pt^s = \/oi \ogt/s and let u be a positive integer. Then, in (29) we 
provide an upper bound of Zij{t), where for brevity we introduce Mt = M{t) ^ M* with (i, j) e Mt- 
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i I 

Zij{t)<l+ ^ l{iij{s) = l}<u+ J2 '^{lij{s) = l,Zij{s-l)>u} 

s=A+l s=«+l 
<U+ ^ i| ^ (XiM*ii){s-l) + Ps-l,T,M*Ci)is-l)) 

- {^iMs{i)is-^)+Ps-l,T,M,(i){s-l)) ^Zij{s-l)>u\ 



s=u+l I ieVm* 



s=u+l 



(0<s»<s),vieVM* . Tr («<Si<s),vieVM, . 



<u 



s=l (l<Si<s-l),VjeVM* («<s-<s-l),ViGVMs [ieVM* ieVMs 



(29) 



Now the event 

implies at least one of the following events 

= {riM*{i),Si < (^iM*{i) - Ps,Si} , ^I&Vm* 

^2,i — {^iMs(i),s[>6iMs{i)+Ps,s[}^ '^i^^M^ 



GiM*{i)< ^ ^iM,(i)+2 ^ Pa,s[ 

igVm* ieVMa ieVM, 



Using Chernoff-Hoeffding bound, we easily get that Pr(Si,i) < s ^",Vi S Vm* and Pr(S2,i) < s ^",Vi S 
Vms- To obtain a bound on the probability of B3, we use 



u > 



4an^ log t 



Prom s^>u and Vm' < n,\/M' e A^, we simply deduce that: 

diM*(i) — ^ ^iM,{i) — 2 ^ Ps,s^ > ^ OiM*(i) — ^ (^iM,{i) — Amin > 0. 

jeVm* ieVM, ieVMj jeVm* »eVMa 



Hence for u > 



4a n log t 



, we get 
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where Ca = Etli(^M* + FMji^"*+^'^*"^". Noting that Vm' < n,\/M' e M, the condition a > n + 1/2 
guarantees that Ca < oo. Finally, 

^r^, 4an^clogi 

IE[2^(t)] < — ^2 — —+0{1), ast^oo, 

min 

which completes the proof. □ 

D Proof of Theorem 4 

We first provide a bound on the probability of choosing a suboptimal configuration M. In what follows, 
we denote X^{t) = E^^Vm hM{i),T,^^,,{t), = /x* - f^^{0), where 

Vm = {« e [n] : Mij = 1 for some j e [c]} 

with cardinality Vm- 

For t > d, the probability of choosing a suboptimal configuration M can be written as 

Pr{7t =M}< |l{M e ^} + (1 - et)^, 

where £; = Pr {X'^{t - 1) > X^* - 1)} • Then 



< Pr< I] ^»M(i),T,„(.)(t-i) > ^ 
UgVm ieVM 



»M(i) 



2V, 



M 



+ Pr^ ^ f,M*(i),T.M,(.)(t-l) < X! ^ 



2ViM"« 



Using the union bound we get 



Pr J riM{i),T,MCi){t-i) > Yl (^iMH) + \ < Y Pr|riM(i),T,M(,)(t-i) > OiM{i) + ^1 

Pr{B.(f-l)} 

Now for z e Vm, using Chernoff-Hoeffding bound we get 



Fr{Bi{t)}<Y,e Pr|T,M(i)(t) 

s=l 



fiM{i),s > ^iM(i) + f (30) 

/Km 



Observe that for w > 0, E1=:.+i e"'"* < ^e""'^. This impUes EUo^+i e H v'm J < 2 (^) e H J . 
Let 2/0=2^ E!=i ^s- We have: 

E e H VM J Pr |T,M(i)(t) = s f.MW,. > ^iM(i) + ^| < 2 j e ^ J 
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We also have: 



, M 



fiM(i),s > ^iM(i) + 



2V, 



M 



^1 



L?/oJ ^ 
< ^Pr T,M(i)(i) = s 



[vol 



<EP'^{^mw(*)<4 

1 

Pr{T;l(i)W<yo}, 



where denotes the number of times that pair {i,M{i)) is chosen during the exploration phase 

up to time i, and where we used the fact that T^^-^{t) < TiM(i){t) implies PT{TiM(i){t) < yo} < 
{^M(i) W < yo}. Now it can be easily shown that: E[T^,,^^^{t)] = V&v[T^^^.^{t)] = \ = 2yo. 

Using Bernstein's inequality, we have Pr < J/o| < e~^. Finally, using = min(l,d/s), we get 

yo = ^ + ^ log I = ^ log ^. In summary, we proved that: 



Pv{Bi{t)} <yoe-^+2 



Vm_ 



Ivoi 



■ iA 



=h{t)t-To^ +GMt~ 



where = ^ (f ) ^"^ log (f ) and Gm = 2 (§) ) . As a result, we obtain: 



< VMh{t - l){t - 1)- + VmGm • (i - 1) ^ . 

It can be shown similarly that 



Pr< E ^iM*W,T,M*w(t-i) < E 



where Gm* = 2 



< VM*h{t - l){t - 1)-Tra + VM*GM-{t - 1) 

2 , , d / a" V 



'^"('^ ~ 2V^ 
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We conclude that: 

Pi-{It = M} < ^1{M eA} + {Vm^ + Vm) (^1 - h{t -l){t- 1)-tSa 

+ VmGm{1- '^^''''> ^Vm^GmAi- -\{t~l) 

The upper bound on regret is then 

t 

j^e- greedy J2 ^^"^ = M} 

s=l M^M* 
t 

s=l M^M* 

+A„axE E {VM + VM^)(l--)ih{s-l){s-l)-^ 



rl\ d ( " 



s=l M^AI* 



+A„axE E Vm^Gm* (l~^){s~l) ^(^y. (31) 

Observe that d > lOAn^/A^^j^ imphes d > 10^1. Note further that Vm < nyu. Then d > 
10AnVA2„i„ imphes that VM 7^ M* 

^>4a(^)' and d>4^fe^' 



As a result, in the R.H.S. of (31), except the first term, the others will be bounded as t grows large. 
Then, after simplifications, we get: 

^e-grcody(^) < dA,„ax log i + 0( 1) , ast-^OO. (32) 

The proof is completed by taking the infimum in the R.H.S. over d > 10 An"^ / A'^^^^. 

□ 
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